A catalog of small proteins from the global microbiome

Small open reading frames (smORFs) shorter than 100 codons are widespread and perform essential roles in microorganisms, where they encode proteins active in several cell functions, including signal pathways, stress response, and antibacterial activities. However, the ecology, distribution and role of small proteins in the global microbiome remain unknown. Here, we construct a global microbial smORFs catalog (GMSC) derived from 63,410 publicly available metagenomes across 75 distinct habitats and 87,920 high-quality isolate genomes. GMSC contains 965 million non-redundant smORFs with comprehensive annotations. We find that archaea harbor more smORFs proportionally than bacteria. We moreover provide a tool called GMSC-mapper to identify and annotate small proteins from microbial (meta)genomes. Overall, this publicly-available resource demonstrates the immense and underexplored diversity of small proteins.

I think the databases NMPfamsDB and SmProt should be mentioned, and the differences to GMSC discussed.I recognise that NMPfamsDB has only very recently been released (and that there is overlap in authors) but I think it could now be useful to include some mention of it.
It may be useful to discuss or mention the fact that many of the ORFs may be from mobile genetic elements such as plasmids or phages.
I would like to see some mention of why the key thresholds (e.g.p<0.05 for RNAcode) were chosen.I appreciate however that space is limited.
Similarly, some more detail of how Prodigal was used could be useful.

# Other comments
Supplementary Fig 3b: label should say "not performed" ~ line 282: I did not find the description of the terminal checking very clear here.Perhaps a diagram could be provided?Line 284 -the version of Antifam used could be stated.https://interprodocumentation.readthedocs.io/en/latest/antifam.htmlLine 398 -"were carried out" Reviewer #1 (Remarks on code availability): I have briefly surveyed the code provided.
It appears to be well written, and providing code for all figures as jupyter notebooks, with associated data appropriately organised, is excellent practice.
Reviewer #2 (Remarks to the Author): The manuscript entitled: "A catalogue of small proteins from the global microbiome" provides a pipeline to generate a large-scale catalogue of smORFs in microbes.The library has the potential to serve as a resource for the microbiome community.

General comments:
The manuscript provides a potentially useful resource of putative microbial smORFs.The resource is fairly well characterized and investigated.However, the manuscript is not easy to follow in terms of how it presents both its methods and results.Many method descriptions are in the Results section or figure captions, while many results are reported in the Methods section.This makes for a hard read of the manuscript.Furthermore, very few criteria used for building this catalogue are justified.The majority of criteria are arbitrary thresholds selected by the authors.Finally, little is done to convince the reader about the validity of the catalogue and its potential use.I provide more specific comments below.

Major comments:
A large number of results are provided in the Methods section.These included, but are not limited to, results provided at line 257, 262, 269, 281, 286, 289, 299, 308, and 313.On the other hand, methods are often described better or repeated in figure captions.A significant reorganization of the text is needed to ease reading.
Moreover, criteria and thresholds for metatranscriptomics, ribo-seq, and metaproteomics smORF confirmation are quite arbitrary.Instead of reporting a single number of confirmed smORFs at a selected threshold.Plots showing the number of smORFs passing at varying thresholds would provide a better grasp of the dataset, and would help providing a reasoning behind the choices of the different thresholds.
The smORFs family construction requires a more detailed explanation.What are the sequences that are clustered?Is it that any sequence that have at least one other sequence with which it has a 90% identity and 90% coverage is used as input for the clustering analysis?You could have three sequences named, A, B and C, with A and B having a 90% identify, A and C also having a 90% identity, but B and C not having this level of identity.Would these be all grouped together?
The procedure used to evaluate the significance of the clusters appears convoluted, under sampled and arbitrary.Why not using a simple bootstrapping approach to evaluate the robustness of the clusters?This is a lot more standard and typical for such analyses.Also how are the representative sequences of the clusters determined?This is not clearly described in the methods.cover bacteria so far and they do not have the same objectives, but some of them are overlapping.Would their approach be applicable here?
The data does not back-up the following conclusion stated by the authors: " Archaea have more transmembrane or secreted small proteins than bacteria".First, no evidence is provided that these specific smORFs are translated into proteins.Second, achaea have way less data points and I would assume that if one would remove a couple of the highest points that look more like outliers than anything else, the result would not be significant anymore.This result appears to be an artefact of the methods used to identify smORFs and transmembrane domains more than anything else.The conclusion of this entire section should be removed or rewritten.
To provide further insights into the validity of the catalogue, I would have expected that sequence conservation would have been directly investigated.One would assume that high-quality predictions are more likely to be functional than low-quality ones.Hence, they should be more likely to be evolutionarily conserved.Is it the case that nucleotides part of these high-quality predictions are more conserved than those that are of lower quality.A fold-enrichment could be provided to yield such an assessment.
In order to provide some insights into the potential applications and discovery potential of the catalogue, it would be interesting to see how these novel smORFs can help identify more peptides and proteins in metaproteomics studies.Most mass spectrometry-based metaproteomics studies will identify proteins using a technique called sequence database search.Providing a set of smORFs not typically included in such sequence database searches could help reveal new proteins never identified in metaproteomics datasets in the past.
No ReadMe are provided with the code, making its evaluation extremely difficult.

Minor comments:
Why were 10,000 randomly selected prokaryotic proteins queried using RPS-BLAST?Versions used should be provided for Python, Pandas, NumPy, and SciPy.
Reviewer #2 (Remarks on code availability): The code was not thoroughly reviewed due to a lack of instructions on how to execute it.Supplementary Fig. 5 (reproduced here for convenience, compared to the previous version, panel c was added; note that this was previously Sup.Fig. 4).Comparison of reference small protein datasets (a) Shown is the fraction of smORFs from high-quality predictions that are homologous to reference small protein datasets.(b) The comparison of the proportions of smORFs from human or non-human habitats between homologs or non-homologs to small protein clusters and conserved families from the Sberro human microbiome dataset.(c) Shown is the fraction of GMSC smORFs that are homologous to NMPfamsDB, FesNov families, smProt2, OpenProt2.0,and sORF.org.
It may be useful to discuss or mention the fact that many of the ORFs may be from mobile genetic elements such as plasmids or phages.
Author response: This is an excellent point.As we now make explicit, when estimating taxonomy we mapped to the GTDB database which only includes prokaryotic genomes.We further mention the possibility that some ORFs may be part of mobile elements in the Results Section "Even conserved small proteins lack functional annotations".

Changes made:
We added the following sentence in Line 126 of Results: "Note that we used the GTDB database, which does not include phage or microeukaryotes."We rewrote the sentence in Line 131 of Results: "Although in some cases, smORFs may be present in plasmids and other mobile elements, we reasoned that multi-genus families would be especially likely to be present in multiple habitats and involved in critical cellular functions." I would like to see some mention of why the key thresholds (e.g.p<0.05 for RNAcode) were chosen.I appreciate however that space is limited.smORFs in microbes.The library has the to serve as a resource for the microbiome community.

General comments:
The manuscript provides a potentially useful resource of putative microbial smORFs.The resource is fairly well characterized and investigated.However, the manuscript is not easy to follow in terms of how it presents both its methods and results.Many method descriptions are in the Results section or figure captions, while many results are reported in the Methods section.This makes for a hard read of the manuscript.Furthermore, very few criteria used for building this catalogue are justified.The majority of criteria are arbitrary thresholds selected by the authors.Finally, little is done to convince the reader about the validity of the catalogue and its potential use.I provide more specific comments below.
Author response: We thank the reviewer for their appreciation of the usefulness of the resource, while also acknowledging their concerns, which we address below.

Major comments:
A large number of results are provided in the Methods section.These included, but are not limited to, results provided at line 257, 262, 269, 281, 286, 289, 299, 308, and 313.On the other hand, methods are often described better or repeated in figure captions.A significant reorganization of the text is needed to ease reading.
Author response: Thank you for pointing out the unclear parts of our text.We have reorganized our text structure including Results, Methods, and Figure captions.We removed and simplified the additional results from the Methods section which are already in the figures or the Results section to make it more concise and clearer to follow.

Changes made (main points, summarized):
We moved the detailed numbers of each step of our pipeline for constructing our catalogue from the Methods section to the caption of figures.We removed the detailed numbers from the Methods section when they are shown in the figures.Specifically, in Methods, we removed the number of rescued singletons and non-singletons which are already shown in Figure 1a, and the number of smORFs that passed each quality test which is already shown in Supplementary Fig. 3 and Results section.We moved the detailed results of the significance validation of clusters from the Methods section to the Caption of Supplementary Fig. 1a-b.
Moreover, criteria and thresholds for metatranscriptomics, ribo-seq, and metaproteomics smORF confirmation are quite arbitrary.Instead of reporting a single number of confirmed smORFs at a selected threshold.Plots showing the number of smORFs passing at varying thresholds would provide a better grasp of the dataset, and would help providing a reasoning behind the choices of the different thresholds.
Author response and Changes made: Unfortunately, there are no well-validated standards in this field.In the case of interpreting the outputs of RNAcode and metaproteomics, we used thresholds previously used in the literature (Sberro et al.;2019;https://doi.org/10.1016/j.cell.2019.07.016;Ma et al. 2022; https://www.nature.com/articles/s41587-022-01226-0),but we acknowledge that this is not a consensus in the field and, furthermore, no analogous examples exist for thresholds applicable to transcriptional or translational data.Therefore, we thank the reviewer for their suggestion, which we implemented: we now add a new Supplementary Fig. 4 (reproduced below) to show how different thresholds lead to different numbers of high-quality predictions.We also make all this information available on the updated website for both download and interactive exploration (Reviewer Figure 1).While we kept our previous thresholds as defaults, users can now choose different combinations of parameters for their queries.
As part of this effort, since we had not saved all the intermediate results, we needed to rerun some of the quality checking.In addition, now we directly screened for proteomic coverage value without retaining one decimal place to make the results more accurate.Given that the results are not completely deterministic, this led to some very minor updates in the resulting high-quality counts.

Supplementary Fig. 4 (reproduced here for convenience). Effect of different thresholds on quality control (a)
The number of smORFs with high coding potential as estimated by RNAcode, using different P-value thresholds.(b) The number of smORFs with transcriptional evidence, using different thresholds for the minimal number of samples required for detection.(c) The number of smORFs with translational evidence, using different thresholds for the minimal number of samples required for detection.(d) The number of detected smORFs in metaproteomics data, using different thresholds for the required k-mer coverage of each smORF-encoded small protein (Methods).Author response: We used the standard pipeline Linclust (Steinegger & Söding, 2018; https://doi.org/10.1038/s41467-018-04964-5),which uses a greedy approach, whereby sequences are compared to candidate representatives.Thus, in the reviewer's example, if A was chosen as a potential representative, it would indeed be chosen as a representative for both B and C, even if B and C do not share this level of identity.Due to the very large size of the input databases, such an approach is necessary to keep the computational costs reasonable.

Changes made:
We now describe the Linclust algorithm as a heuristic single-linkage in Line 292 (novel text in bold) of Methods to make the clustering process easier to understand: "Then we hierarchically clustered the non-singletons at 90% amino acid identity and 90% coverage using Linclust with the following parameters: -c 0.9, --min-seq-id 0.9.Linclust is a single-linkage approach, whereby sequences are clustered together if they share a common representative with candidate representatives being chosen heuristically." The procedure used to evaluate the significance of the clusters appears convoluted, under sampled and arbitrary.Why not using a simple bootstrapping approach to evaluate the robustness of the clusters?This is a lot more standard and typical for such analyses.
Author response: Unfortunately, even after consulting with colleagues, we are not sure what simple bootstrapping procedure the reviewer may have had in mind.Perhaps we had not explained the purpose of the cluster evaluations sufficiently: Our major concern was that, even though we are using a well-established pipeline for clustering (Linclust by Steinegger & Söding, 2018, see above), this pipeline was developed and benchmarked for canonical-length proteins.Therefore, we feared that some results (e.g., the fact that we observe a relatively large fraction of singleton clusters) could be due to us using it inappropriately (namely on small sequences).We wanted to estimate the rate of false negatives (i.e., sequences that were marked as singleton even though they should have been clustered with another one) and false positives (sequences that are members of a cluster even though they do not belong there).It is impossible to perform an exhaustive search for the whole catalogue, so we applied an exhaustive search method to a small, randomly chosen, sample to estimate these false negative/false positive rates.

Changes made:
The corresponding section in the Methods (Line 301, novel text in bold) now reads: "Of these clusters, 47.5% contain a single sequence (singleton clusters).To rule out the possibility that this was due to the fact that Linclust is a heuristic method that is not specifically designed for short sequences, we estimated the rate of false negatives (i.e., sequences that were marked as singleton even though they should have been clustered with another one).We aligned a randomly selected 1,000 singleton clusters against the representative sequences of non-singleton clusters (i.e., those containing ≥ 2 sequences) using SWIPE with the following parameters: -a 18 -m '8 std qcovs' -p 1.The alignment threshold was E-value < 10 -5 , identity ≥ 90%, and coverage ≥ 90% (Supplementary Fig. 1a).
In addition, to estimate the rate of false positive clusterings (sequences that were assigned to a cluster even though they do not share the required identity with the cluster representative), 1,000 sequences were randomly selected and aligned against the representative sequences of their clusters using SWIPE with the following parameters: -a 18 -m '8 std qcovs' -p 1.The alignment threshold was E-value < 10 -5 , identity ≥ 90%, and coverage ≥ 90% (Supplementary Fig. 1b)."

Figure 1 .
Screenshots of the updated website showing the quality information.(a) Quality filtering interface when browsing/searching, (b) Results for a single cluster, showing details.The smORFs family construction requires a more detailed explanation.What are the sequences that are clustered?Is it that any sequence that have at least one other sequence with which it has a 90% identity and 90% coverage is used as input for the clustering analysis?You could have three sequences named, A, B and C, with A and B having a 90% identify, A and C also having a 90% identity, but B and C not having this level of identity.Would these be all grouped together?